Boundary-based MWE segmentation with text partitioning

نویسنده

Jake Williams

چکیده

In this article we present a novel algorithm for the task of comprehensively segmenting texts into MWEs. With the basis for this algorithm (referred to as text partitioning) being recently developed, these results constitute its first performance-evaluated application to a natural language processing task. A differentiating feature of this single-parameter model is its focus on gap (i.e., punctuation) crossings as features for MWE identification, which uses substantially more information in training than is present in dictionaries. We show that this algorithm is capable of achieving high levels of precision and recall, using only type-level information, and then extend it to include part-of-speech tags to increase its performance to state-of-the-art levels, despite a simple decision criterion and general feature space (which makes the method directly applicable to other languages). Since the existence of comprehensive MWE annotations are what drive this segmentation algorithm, these results support their continued production. In addition, we have updated and extended the strength-averaging evaluation scheme, allowing for a more accurate and fine-grained understanding of model performance, and leading us to affirm the differences in nature and identifiability between weaklyand strongly-linked MWEs, quantitatively.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discriminative Lexical Semantic Segmentation with Gaps: Running the MWE Gamut

We present a novel representation, evaluation measure, and supervised models for the task of identifying the multiword expressions (MWEs) in a sentence, resulting in a lexical semantic segmentation. Our approach generalizes a standard chunking representation to encode MWEs containing gaps, thereby enabling efficient sequence tagging algorithms for featurerich discriminative models. Experiments ...

متن کامل

Application of the Tightness Continuum Measure to Chinese Information Retrieval

Most word segmentation methods employed in Chinese Information Retrieval systems are based on a static dictionary or a model trained against a manually segmented corpus. These general segmentation approaches may not be optimal because they disregard information within semantic units. We propose a novel method for improving word-based Chinese IR, which performs segmentation according to the tigh...

متن کامل

A Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling

In this paper, a modified segmentation algorithm for printed Farsi words is presented. This algorithm is based on a previous work by Azmi that uses the conditional labeling of the upper contour to find the segmentation points. The main objective is to improve the segmentation results for low quality prints. To achieve this, various modifications on local baseline detection, contour labeling an...

متن کامل

Impact of MWE Resources on Multiword Recognition

In this paper, we demonstrate the impact of Multiword Expression (MWE) resources in the task of MWE recognition in text. We present results based on the Wiki50 corpus for MWE resources, generated using unsupervised methods from raw text and resources that are extracted using manual text markup and lexical resources. We show that resources acquired from manual annotation yield the best MWE taggi...

متن کامل